Training linear ranking SVMs in linearithmic 
time using red-black trees 
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Abstract 

We introduce an efficient method for training the linear ranking sup- 
port vector machine. The method combines cutting plane optimization 
with red-black tree based approach to subgradient calculations, and has 
0(ms + Tnlog(m)) time complexity, where m is the number of training 
examples, and s the average number of non-zero features per example. 
Best previously known training algorithms achieve the same efficiency 
only for restricted special cases, whereas the proposed approach allows 
any real valued utility scores in the training data. Experiments demon- 
strate the superior scalability of the proposed approach, when compared 
to the fastest existing RankSVM implementations. 



1 Introduction 

Learning to rank has been a task of significant interest during tfie recent years. 
The ranking problem has been largely motivated by applications in areas such 
as web search and recommender systems. Due to the large amounts of data 
available in these domains, it is necessary for the used algorithms to scale well, 
preferably close to linear time methods are needed. For a detailed introduc- 
tion to the topic of learning to rank, we refer to 



Hiillermeier 2011) 



(Liu 2009 



Fiirnkranz and 



In this work we assume the so-called scoring setting, where each data instance 
is associated with a utility score reflecting its goodness with respect to the 
ranking criterion. A successful approach for learning ranking functions has 
been to consider pairwise preferences (Fiirnkranz and Hiillermeier 2005). In 
this setting, the aim is to minimize the number of pairwise mis-orderings in 
the ranking produced when ordering a set of examples according to predicted 
utility scores. A number of machine learning algorithms optimizing relaxations 
of this criterion have been proposed, such as the RankBoost (Freund et al 



[2003 j , RankNet ( [Surges et aL||2005J , RankRLS (Pahikkala et al.^ 2007n2009[ ) 
and the subject of this study, the ranking support vector machine (RankSVM) 
algorithm (Herbrich et al 



1999l|Joachims";2002'). 



The original solution proposed for RankSVM optimization was to train a 
support vector machine (SVM) classifier on pairs of data examples. Adapting 
standard dual SVM solvers to RankSVM training leads to 0{ra^) scaling or 



worse, with respect to the training set size m (Bottou and Lin 2007). While 



linear RankSVMs can be trained more efficiently by solving the primal opti- 



mization problem for SVMs (Chapelle 2007 Chapelle and Keerthi 2010), still 



the complexity of any method that is explicitly trained on all the pairs has at 
the very least quadratic dependence on the number of training examples. 

Joachims (2005 20061 has shown that linear RankSVM can be trained 
using cutting plane optimization, also known as bundle optimization, much 



more efficiently in certain special settings. Joachims (2005) has proposed an 
0{ms + m\og(rn)) time algorithm, where m is the number of training examples, 
and s the average number of non-zero features per example, for the bipartite 
ranking problem, where only two utility levels are allowed. The bipartite rank- 
ing problem corresponds to maximizing the area under the receiving operating 



characteristic curve (AUG) ( Hanley and McNeil 1982 ) , a performance measure 



widely used in machine learning (see e.g. (Bradley 1997 Provost et al. 1998)). 
An 0{ms + mlog{m) -j- rm) time generalization of the method has been pre- 
sented for the case, where r different utility levels are allowed fJoachims 2006). 
Chapelle and Keerthi ( 2010[ ) have recently further explored efficient methods 
for training RankSVM, the proposed methods have similar scaling. 

If r is assumed to be a small constant, the existing methods are compu- 
tationally efficient. This is for example the case in the bipartite ranking case 
where there are only two utility levels, corresponding to the "good" and the 
"bad" objects. Similarly, movie ratings ranging from one to five stars could be 
encoded using r — 5 distinct utility levels. However, in the general case where 
unrestricted scores are allowed, if most of the training examples have different 
scores r « to leading to 0{ms + m?) complexity. This worst scale quadratic 
scaling with respect to the training set size limits the applicability of RankSVM 
to large scale learning. 

In this work we generalize the work of Joachims (12006) and present an 
0{ms + ■m\og{m)) time training algorithm for linear RankSVM, that is ap- 
plicable in the most general case, where arbitrary real-valued utility scores are 
allowed in the training data. The method is based on using modified red-black 



tree data structures (Bayer 1972 Cormen et al. 2001 1 for speeding up the eval- 



uations needed in the optimization process. Computational experiments show 
that the method has excellent scalability properties also in practice. 

In Section [2] we introduce the general learning to rank setting, and in Sec- 
tion |3] we formalize the regularized risk minimization problem, and present the 
general optimization framework for solving it. In Section HI we present our main 
contribution, the efficient RankSVM subgradient and loss computation algo- 
rithms, and in Section [5] we present an experimental evaluation of the resulting 
training algorithm. We conclude in Section [6] 



2 Learning setting 



Let D be a probability distribution over a sample space Z = M" x R. An example 
z = (x, y) G -Z is a pair consisting of an n-dimensional column vector of real- 



valued features, and an associated real-valued utility score. Let the sequence 
Z = ((xi, yi), . . . , (xm, J/m)) G Z™- drawn according to Z? be a training set of m 
training examples. X S K"^'" denotes the n x m data matrix whose columns 
contain the feature representations of the training examples, and y € R™ is a 
column vector containing the utility scores in the training set. Our task is to 
learn from the training data a ranking function / : M" -^ M. In the linear case 
such a function can be represented as /(x) = w"'"x, where w € M" is a vector 
of parameters. 

The difference between ranking and regression is that in ranking, the actual 
values taken by the prediction function are typically not of interest. Rather, 
what is of interest is how well the ordering acquired by sorting a set of new 
examples according to their predicted scores matches the true underlying rank- 
ing. This is a reasonable criterion for example in the web search engines and 
recommender systems, where the task is to choose a suitable order in which to 
present web pages or products to the end user. A popular way to model this 
criterion is by considering the pairwise preferences induced by a ranking (see 



e.g. (Fiirnkranz and Hiillermeier 2005)). We say that an example Zi is preferred 



over example Zj, if j/i > yj. In this case one would require from the ranking 
function that /(x^) > /(xj). 

The performance of a ranking function can be measured by the pairwise 
ranking error defined as 



1 ^[/(x,)>/(x,)], 



(1) 



Vi <Vj 



where TV is the number of pairs for which yi < yj holds true. The equation (nj) 
simply counts the number of swapped pairs between the true ranking and the 
one produced by /. 

By restricting the allowed range of utility scores we can recover some popular 



special cases of the introduced setting. In ordinal regression (Herbrich et al 



1999 Waegeman et al. 2008 ) it is assumed that there exists a finite, often quite 



small set of possible discrete ordinal labels. For example, movie ratings ranging 
from one star to five stars would constitute such a scale. In the bipartite ranking 
task where only two possible scores are allowed equation (n]) becomes equivalent 
to the Wilcoxon-Mann- Whitney formula used to calculate AUG (Hanley and 



McNeil 1982 Cortes and Mohri 2004) 



In some learning to rank settings instead of having a total order over all 
examples, the sample space is divided into disjoint subsets, and pairwise pref- 
erences are induced only from pairwise comparisons between the scores of ex- 
amples in the same subset. An example of an application settings where this 
approach is commonly adopted is document retrieval, where data consists of 
query-document pairs, and the scores represent the utility of the document 



with respect to the associated user query (Joachims 2002). Preferences are in- 



duced only between query-document pairs from the same query, never between 
examples from different queries. In such settings we can calculate (fTj) separately 
for each subset, and take the average value as the final error. 



Minimizing ([!]) directly is computationally intractable, successful approaches 
to learning to rank according to the pairwise criterion typically minimize convex 
relaxations instead. The relaxation considered in this work is the pairwise hinge 
loss, which together with a quadratic regularizer forms the objective function of 
RankSVM. Before formally defining the loss, we introduce a general optimiza- 
tion method suitable for minimizing it. 

3 Bundle Methods for Regularized Risk Mini- 
mization 

A large class of machine learning algorithms can be formulated as the uncon- 
strained regularized risk minimization problem 



argmin J('w), 

wSK" 



(2) 



where 



J(w) = i?emp(w) + A||w|| 



w is the vector of parameters to be learned, Remp is the empirical risk measuring 
how well w fits the training data, ||w|p is the quadratic regularizer measuring 
the complexity of the considered hypothesis, and A G K"*" is a parameter. We 
assume that Remp '■ K" — > M is convex and non-negative. 

Different choices of Remp result in different machine learning methods such 



as SVM classification (Cortes and Vapnik 19951 or regression (Drucker et al 



19971, regularized least-squares regression (Poggio and Girosi, 1990), structured 



output prediction methods ( Tsochantaridis et al. 20051), RankRLS (Pahikkala 



et al. 2007 2009) and RankSVM (Herbrich et al. 1999 Joachims 2002). 



Bundle methods for regularized risk minimization (BMRM) ( Teo et al. 2007 



Smola et al. 2007 Teo et al. , 20101, is a general and efficient optimization 



technique for solving ([2|) . The method is also known as the cutting plane method 
in the machine learning literature, and it was under this name it was first 
introduced for the purpose of efficiently optimizing large margin type of loss 
functions ( Tsochantaridis et al. 2005 1 . It was later shown, that the method 



can be generalized to arbitrary convex loss functions, as long as subgradient 



evaluations for the loss can be done efficiently (Teo et al. 2007 Smola et al 



2007 1 . In the following we adopt our notation and terminology from the BMRM 



literature. 

BMRM iteratively constructs a piecewise linear lower bound approximation 



of R. 



■emp- 



Let Rt be the piecewise linear approximation at iteration t. We 



approximate (pi) with 



Wt 



argmin Jt("w), 

weK" 



(3) 



where 



Jt(w)-i?t(w) + A|lw|' 



Thus the regularizer remains the same, but the empirical risk term is replaced 
with the piecewise linear lower bound. 

The empirical risk is lower bounded by its first order Taylor approximation 
at any w' £ M", defined as 

i?emp(w) > i?emp(w') + (w - w',a'), 

where a' is any subgradient of Remp at w'. By defining 6' — Rempi"^') — ('w', a') 
due to the linearity of the inner product this can be re-written as 

i?emp(w) > (w,a')+6', 

(w, a') + 6' = is called a cutting plane. 

Using several cutting planes BMRM approximates Remp with 

i?t(w) = max{(w,ai) +6J}. 

i—l...t 

Using this approximation ([3| can be solved by solving an equivalent quadratic 
program whose size depends on the number of cutting planes used. 

Algorithm 1: BMRM 
Input: Wo, e > 
Output: Wh 

1 f ^0; 

2 Wb <- Wo; 

3 repeat 

4 t 4" t-1- 1; 

5 at -i— subgradient of Remp at "Wt-i; 

6 bf^ i?emp(Wt_i) - (wt_i, ai); 

7 Update Rti'w) by adding the new cutting plane {-,^.1) + bt] 

8 wt ^ argmin^ Ji(w); 

9 if J{wt) < J(wf,) then 

10 I Wf, -(— Wt 

11 end 

12 et ^ J(wf,) - Jt(wt) 

13 until et < e; 



The procedure according to which BMRM builds the lower bound is de- 
scribed in algorithm [T] The formulation given here differs slightly from that 
of Teo et al. (2010[) in that following the suggestion of Franc and Sonnenburg 



(2009) we maintain the best this far seen solution w^h, and only update the so- 
lution when the new solution w^ is better. The parts of the algorithm which 
depend on the choice of Remp are the calculation of a subgradient and the value 
of Remp at point Wt_i, as well as the termination criterion. 

The rate of convergence is for BMRM independent of the training set size 



( Smola et al. , 2007 ) . The size of the quadratic program solved on line 8 does grow 



with the number of iterations, but its computational cost is on large datasets 
insignificant compared to the cost of computing the cutting plane. Thus what is 
required for efficient learning with a convex loss function is an efficient algorithm 
for computing its value and subgradient. 



4 Efficient computation of loss and subgradient 

Next we present an efficient algorithm for evaluating the empirical risk and its 
subgradient for RankSVM. First, in Section |4J] we recall results from | Joachims] 
( 2006 1 to identify the computational bottleneck in RankSVM training, which oc- 



curs in the loss and subgradient computation. Next, in Section 4.2 we introduce 
search tree algorithms, which we use to speed up these computations. Build- 
ing on these results we present an efficient algorithm for loss and subgradient 
computation in Section [4. 3| 



4.1 Preliminaries 



The following results were first presented by Joachims (2006), who formulated 



the RankSVM optimization problem as a constrained optimization problem. 
In this work we follow a different but equivalent formulation of RankSVM as 



an unconstrained optimization problem within the BMRM framework (see Teo 
eFaL](|20T0l)). 



The average pairwise hinge loss computed over the training set, with respect 
to a given solution w is 



yi<Vj 



T 
W Xi 



W Xj) 



■3h 



(4) 



where N is the number of pairs for which y^ < j/j holds true. 

The pairwise hinge loss, together with the quadratic regularizer forms the 
objective function of RankSVM. The most obvious approach to evaluating Q, 
and also its subgradient involves going explicitly through all the pairs in the 
training set. However, this would lead to 0(m") complexity, which is inefficient 
on large data sets. Let us define 



\{j : iy^ < Vj) A i 



W Xi > W X, 



1) A (1 < j < m)}\ 



(5) 



and 



d^ = \{j ■■ {y^ > Vj) A (w^x, < w^x, + 1) A (1 < J < m))\. (6) 

Using the frequencies ^ and (|6|, we recover an alternative formulation for 
the empirical risk. 



Lemma 1 (Joachims (2006); Teo et al. (2010)). The average pairwise hinge 
loss Op can he equivalently expressed as 



N ^-^ 

1=1 



((C, -rfi)w'^Xi +Ci). 



(7) 



Similarly, computation of ([5]) and (|6| allows the computation of a subgradient 
of the empirical risk. 



Lemma 2 (Joachims (20061; Teo et al. (2010)). A subgradient of (Op can be 
expressed as 



Vi?e,„p(w) ^ l^^{Ci- di)x. 



(8) 



Inner product evaluations, scalar-vector multiplications and vector summa- 
tions are needed to calculate (ItI) and (|8|. These take each 0{s) time, the average 
number of nonzero elements in a feature vector. Provided that we know the val- 
ues of Ci, di and N, both the loss and subgradient can thus be evaluated in 
0{ms) time. 



[Joachims (2006) (and equivalently Teo et al. (20101) describe a way to cal- 
culate efficiently these frequencies, and subsequently the loss and subgradient. 
However, the work assumes that the range of possible utility score values is re- 
stricted to r different values, with r being quite small. The algorithm requires 
0{r) passes through the training set, contributing a 0{rm) term to the overall 
complexity, which is 0{ms + m log(TO) -I- rm). If the number of allowed scores is 
not restricted, at worst case r = m with the resulting complexity 0{ms -\- rn^), 
meaning quadratic behavior with respect to the training set size. However, as 
we show next, the dependence on r can be removed from the algorithm by uti- 
lizing order statistics trees, resulting in 0{ms + mlog{m)) cost also in the most 
general case, where arbitrary real valued utility scores are allowed. 



4.2 Order statistics tree 



The order statistics tree ( Gormen et al. 2001 ) is a balanced binary search tree. 



which has been augmented to support logarithmic time computation of order 
statistics, such as the rank of a given element in the tree, or the recovery of the 
fc:th element in the tree. As we will show in the following, the data structure 
also allows efficient computation of the frequencies needed in the RankSVM 
loss and subgradient computations. Next, we introduce the order statistics tree, 
recall its main properties, and introduce algorithms which act as building blocks 
for the fast RankSVM training method. In the following, we assume that the 
number of elements inserted to the search tree is bounded by m. 

The binary search tree is one of the most fundamental data structures in 
computer science. It is a linked data structure consisting of nodes. A given 
node X contains a real valued key(x), and pointers to a parent node par(a;), a 
left child left(a;), and a right child right (x). A parent or a child may be missing, 
denoted by the value 0. Only a single node known as the root node is allowed 
to have missing parent. Nodes with no children are called leaf nodes. The root 
of an empty tree is assumed to be 0. The height of a binary search tree is the 
length of the path from the root node to the lowest leaf node, and the size of a 
tree is the number of elements it contains. We call the tree rooted at left(a;) the 
left subtree of x, and the tree rooted at right(a;) the right subtree of x. The tree 



always satisfies the binary search tree property, requiring that the nodes stored 
in the left subtree of a node have smaller than or equal keys as the node, and 
the nodes in the right subtree have larger than or equal keys as the node. 

The worst-case cost of searching for an element in the tree, or inserting or 
deleting an element is proportional to the height of the tree. To guarantee 
the efficiency of such operations, self-balancing binary trees combine basic tree 
update operations with maintenance operations that maintain 0(log(m)) tree 
height. One of the most popular self-balancing search tree variants is the red- 
black tree (Bayer 1972 Gormen et al. 2001). Further, 



describe a modified variant of the red-black tree defined as follows: 



(Gormen et al. 2001 



Definition 1 (Order statistics tree ( Gormen et al. (2001))). The order statistics 



tree is a self-balancing binary search tree, with the following properties 

• The binary search tree property: "Let x be a node in a binary search tree. 
If y is a node in the left subtree of x, then key(j/) < key (a;). If y is a node 
in the right subtree of x, then key(a;) < key(2/)." 

• Balance: 0(log(rn)) height after arbitrary insertions and deletions. 

• Each node of the order statistics tree stores an additional attribute size 
defined as 



size(a::) 



ifx^% 

size (left (x)) + sizc(right(a;)) + 1 otherwise 



The correct value for this attribute is maintained after arbitrary insertions 
and deletions. 

Note that the definition allows the existence of multiple nodes with the same 
key value in the tree. 

Let Tree-Insert (T, x) be the insertion operation of an arbitrary node x to an 
order statistics tree T, whose size is bounded by m. Tree-Insert adds the new 
node to the tree, maintaining the binary search tree property, the correct values 
of the size attribute in the nodes, and the 0(log(m)) height of the tree. The 
time complexity of the operation is characterized by the following Lemma: 



Lemma 3 (Gormen et al. ( |2001 )). The time complexity o/ Tree-Insert to order 
statistics tree is Oi\og(m)). 

Further, for the RankSVM computations we require routines that efficiently 
compute the number of elements in the tree with a smaller, or a larger value than 
a given argument. Let fc be a real value, and let T be an order statistics tree, 
whose size is bounded by m. Then, Gount-Smaller(root(r), fc) (Algorithm l2| 
returns the number of nodes in T with a smaller key value than fc. The al- 
gorithm Gount-Larger(root(r), fc), which is not presented separately, works in 
analogous fashion returning the number of nodes with a larger key value than 
the argument. 



Lemma 4. The correctness of Gount-Smaller and Gount-Larger. 



Proof. Let C{x, k) denote the number of values smaller than k stored in a binary 
search tree whose root x is. Due to Definition [l] the following always holds 

r ifa;==0 

C{x, k) = < C(right(a;), fc) + size(left(a;)) + 1 if key(a;) < k . 
[ C{leit{x),k) otherwise 

This recursive equation supplies us directly with an algorithm for computing 
C(a;, fc), which is implemented in Count-Smaller. The proof for Count-Larger is 
analogous. D 

Lemma 5. Count-Smaller and Count-Larger have 0(log(?7i)) complexity. 

Proof. On each call of Count-Smaller or Count-Larger, 0(1) cost computations 
are performed, and additionally the routine may call itself once with a child 
node of the input node. At worst case the recursion proceeds until the lowest 
leaf node in the tree is reached, requiring a number of calls proportional to the 
height of the tree, which is according to Definition [l] guaranteed to be of the 
order 0(log(m)). D 



Algorithm 2: Count-Smaller 
Input: X, k 

1 if X = then return 0; 

2 else if key(x) < k then return 

Count-Smaller(right(2;), k) -\- size(left(a;)) 

3 return Count-Smallcr(left(x), fc); 



Let r be the number of distinct keys stored in the tree. In the presence of 
a large number of duplicates, meaning r << m, the order statistics tree can 
be implemented more efficiently by storing duplicate keys to the same node. In 
such an implementation, each node contains an additional attribute nodesize(x), 
which measures how many times key(x) has been inserted to the tree. When in- 
serting a new key, a new node is not created if the key already exists in the tree, 
but rather the nodesize attribute of the existing node is incremented by one. 
Thus, we need to re-define size(a;) = size(left(a;)) -|-size(right(a;)) -I- nodesize(a;). 
For this modified variant of the order statistics tree, the height of the tree, 
and the cost of Tree-Insert, Count-Smaller and Count-Larger are bounded by 
0(log(r)). However, this improvement does not translate into further improve- 
ments in the asymptotic cost of RankSVM training, due to 0(TOlog(TO)) cost of 
a sorting operation that is required on each iteration of training. 

4.3 Subgradient computation 

Algorithm p^ presents the main contribution of this paper, an 0{ms -\- m\og{m)) 
time method for calculating the RankSVM loss and subgradient. The algorithm 
uses order statistics trees to efficiently compute the frequencies (l5| and (l6]). 



Algorithm 3: Subgradient and loss computation 



Input: X, y, w, N 
Output: a, loss 

1 p 4- X'^w; 

2 c 4— 771 length column vector of zeros; 

3 d 4— TO length column vector of zeros; 

4 TT -s— training set indices, sorted in ascending order according to p; 

5 r 4— new empty search tree; 

6 j ^ 1; 

7 for z 4— 1 to ?Ti do 



8 while (j < to) and (p[7r[z]]] > p[7r[j]] — 1) do 

9 Tree-Insert(T,y[7r[j]]); 

10 j ^ j + 1; 

11 end 

12 c[7''W] ^^ Count-Largcr(root(T),y[7r[i]]); 

13 end 

14 T <(— new empty search tree; 

15 j -s— to; 

16 for z 4— TO to 1 do 

17 while (j > 1) and (p[TT[i]] < p[Tr[j]] + 1) do 

18 Tree-Insert(T, y[7r[j]]); 

19 j ^ j - 1; 

20 end 

21 d[7r[«]] ^ Count-Smaller(root(T),y[7r[i]]); 

22 end 

23 loss^ i E" i((cW - d[i]) * p[i] + c[i\); 

24 a ^ iX(c-d); 



10 



which are necessary in computing the value of the loss ([7|) and a subgradient 
(§. 

Theorem 1. The correctness of Algorithm^ 

Proof. The algorithm computes the predicted utility scores for the training set 
with p — X^-w (note that p[z] — w"'"xi). Next, an index list tt is created, where 
the indices of the training examples are ordered in an increasing order, according 
to the magnitudes of their predicted scores, so that p[7r[l]] < p[7r[2]] < . . . < 
p[7rH]. 

Let us consider the i:th iteration of the for-loop on lines 7— 11, with 1 < i < 
m. After the while loop on lines 8—10 has been executed, the index j divides 
the training set into two parts. For 1 < fc < j it holds that p[7r[i]] > p[7r[A;]] — 1 
and for j < fc < m it holds that p[7r[i]] < p[7r[fc]] — 1. The keys corresponding 
to the indices 7r[l] ...7r[j — 1] have, either on this or on previous iterations, 
been inserted to the order statistics tree T, by the Tree-Insert call on line 9. 
Therefore, on line 11 where Count-Larger (root (r),y[7r[z]]) is called, T contains 
the labels of the training examples indexed by the set 

{k : (p[7r[z]] > p[fc] - 1) A (1 < fc < m)} 

According to Lemma pj Count-Larger(root(T),y[7r[z]]) returns the number of 
keys in T, with a larger value than the argument y[7r[i]]. Thus, line 11 stores to 
c[7r[i]] the value 

\{k : (y[7rH] < y^) A (p[^W] > p[fc] - 1) A (1 < fc < m)}|, 

which, according to ([5| is the frequency c^^i^ 

It can be verified analogously, that the for-loop on lines 14 — 18 computes 
the frequencies di...dm according to Q. After line 18 has been executed, 
the algorithm has thus filled two arrays, c = [ci , . . . , c,„] and d = [di , . . . , dm] ■ 
Using these frequencies, the loss is computed on line 19 according to (I?]), and 
the subgradient on line 20 according to ([s]). 

n 

Theorem 2. The complexity of calculating the loss and the subgradient with 
Algorithm^ is 0{ms + m\og{m)) for any training set of size m and sparsity s, 
with unrestricted range for utility score values allowed. 

Proof. The cost of computing X'^w on line 1 using standard sparse matrix - 
vector product multiplication algorithms is 0(ms). Initializing the m-length 
arrays on lines 2 and 3 is a 0{m) operation, whereas the empty search tree 
initializations in lines 5 and 12 take 0(1) time. The sorting operation on line 4 
can be done in 0{m \og{'m)) using for example the heapsort algorithm ( [Williams 



1964 1. Tree-Insert on line 9, and Count-Larger on line 11 are both called exactly 
m times, once for each training example. According to Lemmas [3] and [5J both 
operations have 0(log(m)) cost, resulting in 0(mlog(m)) complexity for the 
lines 7—11 altogether. Analogously, lines 14 — 18 have also 0(7TT,log(m)) cost, 

11 



since Tree-Insert on line 16 and the Count-Smaller on line 18 are both called 
exactly m times. Finally, the loss computation on line 19 requires 0{m) floating 
point operations, and the subgradient computation on line 20 requires a Oirns) 
matrix-vector product. Summing all the complexities together, the resulting 
computational complexity of Algorithm w] is 0{ms -t- mlog(?7i)). 

D 

Next, we present a theorem that characterizes the overall complexity of 
RankSVM training using the introduced subgradient and loss computation al- 
gorithm. The theorem and its proof are similar to those presented by [Joachims 



(12006). However, unlike Joachims, we do not (implicitly) assume the number of 
different relevance level to be constant. 

Theorem 3. For any fixed e > and A > 0, the computational cost of linear 
RankSVM training with BMRM (AlgorithmYw using AlgorithmlMfor loss and 
subgradient computations is 0{ms + m\og{m)) for any training set of size m 
and sparsity s, with unrestricted range for utility score values allowed. 



Proof. Theorem 5 in (Smola et al. 20071 states that the BMRM has, under 
minor technical assumptions, 0[-^) speed of convergence to e-accurate solution. 
The convergence speed does not depend on the values of m and s. During 
initialization, the exact value of N can be computed in 0(TOlog(TO)) by sorting 
the true utility scores of the training examples. Further, the only computations 
within each iteration that depend on m and s are the loss and the subgradient 
computations. Therefore, the computational complexity of training RankSVM 
is the same as that of Algorithm [SJ which is according to Theorem [2] 0(tos H- 
mlog(m)). D 

As discussed previously, in some ranking settings we do not have a global 
ranking over all examples. Instead, the training data may be divided into 
separate subsets, over each of which a ranking is defined. Let the training 
data set be divided into R subsets, each consisting on average of ^ examples. 
Then we can calculate the loss and the subgradient as the average over the 
losses and subgradients for each subset. The computational complexity becomes 
0{R * (f s + f log(f )) = 0{ms + ™log(f )). 

5 Computational experiments 

In the computational experiments we compare the scalability of the proposed 
0{ms + r?T,log(TO)) time training algorithm to the fastest previously known ap- 
proach. In addition, we compare our implementation to the existing publicly 
available RankSVM solvers. The considered data sets each contain a single 
global ranking, and the utility scores are real valued. This means that r ^ m, 
and the number of pairwise preferences in the training sets grows quadratically 
with m. In Section |5.1| we describe the experimental setup, and in Section [5. 2| 
we present the experimental results. 
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Figure 1: Average iteration cost. Cadata (left) and Reuters (right). 



5.1 Experimental setup 

We implement the proposed method, denoted as TreeRSVM, as well as a base- 
line method PairRSVM, which iterates over all pairs to calculate the frequencies 
necessary for the loss and subgradient computation. Both methods are based 
on our own implementation of BMRM, and are integrated to the RLScorqJopen 
source machine learning software framework developed by us. The majority of 
the code is written in Python. All matrix operations are implemented in NumPy 
and SciPy, and for solving the quadratic program arising in each BMRM itera- 
tion we use the CVXOPTJjopen source convex optimization software. The most 
computationally demanding parts of the subgradient and loss computations, in- 
cluding the search tree implementation, are written in C language. The only 
difference between the two implementations is in the subgradient computation 
routine. 

In addition, we compare our method to the fastest publicly available previous 
implementations of RankSVM. The SV M''""'' sof t ware is a C-language imple- 

In theory, SVM™"'^ 



mentation of the method described by Joachims (2006). 
and PairRSVM implement exactly the same method. In practice, implemen- 
tational differences such as the use of different quadratic optimizers, and the 
inclusion of certain additional heuristics within S VM'^"" , mean that there may 
be some differences in their behavior. PRSVM implements in MATLAB a trun- 



cated Newton optimization based method for training RankSVM ( Chapelle and 



^ http://www.tucs.fi/rlscore 

^ Ettp : //abel . ee . ucla . edu/cvxopt/ 
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Figure 2: Runtimes for different RankSVM implementations. Cadata (left) and 
Reuters (right). 



Keerthi 2010 



PRSVM optimizes a slightly different objective function than 
the other implementations, since it minimizes a squared version of the pairwise 
hinge loss. Finally, there exists an implementation of RankSVM in the SVM''^''* 
software package. It has however been previous ly shown to be orders of mag- 



nitude slower than either SVM''""" or PRSVM (Joachims 



2006 



Chapelle and 



Keerthi 2010), and therefore we do not include it in our comparison. 

?nlog(?n)) training time complexity, whereas all 
m?) training time complexity. Therefore, 



TreeRSVM has 0{ms 
the other methods have 0{ms 
TreeRSVM should on large datasets scale substantially better than the other im- 
plementations. Further, all the methods other than PRSVM have 0{ms) mem- 
ory complexity due to cost of storing the data matrix. PRSVM has 0{ms + m^) 
memory complexity, since it also forms a sparse data matrix that contains two 
entries per each pairwise preference in the training set. (Chapelle and Keerthi 



2010 1 also describe an improved version of the method which they state to have 
similar scalability as SVM''"" , but there is no publicly available implementation 
of this method. 

The experiments are run on a desktop computer with 2.4 GHz Intel Core 2 
Duo E6600 processor, 8 GB of main memory, and 64-bit Ubuntu Linux 10.10 
operating system. For TreeRSVM, PairRSVM and SVM''""'' we use the ter- 
mination criterion e < 0.001, which is the default setting of SVM''""''. The 
e parameter has exactly the same meaning and scaling for all of these imple- 
mentations. For PRSV M we use the termination cr iterion Newton decrement 
< 10~^, as according to Chapelle and Keerthi (2010) this is roughly equivalent 
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Figure 3: Memory usage for different RankSVM implementations, measured on 
Reuters. 



to the termination criterion we use for the BMRM based methods. SVM''"" 
and PRSVM use a regularization parameter C that is multiplied to the empir- 
ical risk term rather than A, and do not normalize the empirical risk by the 
number of pairwise preferences iV. Therefore, the proper conversion between 
the A and C values is C = ^. 

We run experiments of two publicly available data sets. Cadataj^is a low- 
dimensional data set consisting of approximately approximately 20000 examples, 
each having 8 features. The real valued labels are used directly as utility scores. 



Our second data set is constructed from the Reuters RCVl collection (Lewis 
et al. 2004), and consists of approximately 800000 documents. Here, we use 



a high dimensional feature representation, with each example having approx- 
imately 50000 tf-idf values as features. The data set is sparse, meaning that 
most features are zero- valued. The utility scores are generated as follows. First, 
we remove one target example randomly from the data set. Next, we compute 
the dot products between each example and the target example, and use these 
as utility scores. In effect, the aim is now to learn to rank documents according 
to how similar they are to the target document. 



Similarly to the scalability experiments of Chapelle and Keerthi (2010), we 



compute the running times using a fixed value for the regularization parameter, 
and a sequence of exponentially growing training set sizes. We use A — 10^^ 
for Cadata, and A = 10~^ for Reuters, as these were observed to lead to good 
test performance. The relative differences in running times between the methods 



^ http : //www . csle . ntu . edu . tw/-c j lin/libsvmtools/datasets/ 
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Figure 4: Test error for different RankSVM implementations. Cadata (left) and 
Reuters (right). 



were observed to be similar also for any other tested choices of the regularization 
parameter values, though the absolute runtimes are for all the methods the 
larger the smaller the value of A is. For Cadata we consider the training set 
sizes [1000, 2000, 4000, 8000, 16000]. For Reuters we consider the training set 
sizes [1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000, 256000, 512000]. 

5.2 Experimental results 

In Figure IT] we plot the average time needed for subgradient computation by 
the TreeRSVM and the PairRSVM. It can be seen that the results are consis- 
tent with the computational complexity analysis, the proposed method scales 
much better than the one based on iterating over the pairs of training examples 
in subgradient and loss evaluations. On Reuters with half a million training 
examples PairRSVM already takes 2760 seconds (46 minutes) to finish a single 
iteration, whereas the same is achieved by TreeRSVM in 7 seconds. 

Next, we compare the scalability of the different RankSVM implementations. 
In Figure [2] we present the runtimes of all the different implementations, when 
trained to convergence. As expected, TreeRank achieves orders of magnitude 
faster training times than the other alternatives. PRSVM could not be trained 
beyond 8000 examples due to large memory consumption. On Cadata SVM''""'^ 
showed much worse scalability than should be expected. More detailed study 
of the SVM*""" results revealed that almost all of the runtime was consumed 
by the quadratic solver, which on some iterations failed to make progress for a 



16 



substantial amount of time. The behavior seemed to be caused by numerical 
problems, our implementations which use the CVXOPT solver did not have 
similar issues. On the Reuters data SVM''""''^ did not have such problems, the 
method showed similar scaling as PairRSVM, as expected. With 512000 training 
examples on Reuters training SVM'"" took 83 hours, and training PairRSVM 
took 122 hours, whereas training TreeRSVM took only 18 minutes in the same 
setting. 

In Figure IS] we plot the peak memory usage of the considered implementa- 
tions to give a rough idea about their scalability in terms of memory efficiency. 
We present results only for the Reuters data, since the Cadata has too few train- 
ing examples and features per example to allow reliable benchmarking of the 
methods which have linear scaling in memory usage. PairRSVM is left out of the 
comparison, since it has almost identical memory consumption as TreeRSVM. 
PRSVM has quadratic memory complexity with respect to the training set size, 
and therefore consumes several GB of memory already at 8000 training exam- 
ples. Both TreeRSVM and SVM^"" start to show linear behavior in memory 
complexity once the sample size grows large enough, as expected. At first the 
Python based TreeRSVM implementation is much less memory efficient than 
the C-language SVM''"" implementation, but as the sample size grows the dif- 
ference becomes smaller. For the largest sample sizes TreeRSVM uses roughly 
2.5 times the amount of memory used by SVM'"""'"'. The difference is due to 
the fact that the TreeRSVM implementation maintains two copies of the data 
matrix, one optimized for fast row- and one for fast column access. Better 
memory efficiency could thus be achieved by maintaining only one copy of data 
matrix, but initial experiments showed that this would lead to roughly sevenfold 
increase in training time as measured on Reuters data. 

Finally, in Figure |4] we plot the pairwise ranking errors, for the different im- 
plementations, as measured on independent test sets. For Cadata we use 4000, 
and for Reuters 20000 test examples for computing the test errors. PairRSVM 
is left out of the comparison, since it always reaches exactly the same solution 
as TreeRSVM. These measurements act as a sanity check, showing that de- 
spite the implementational differences TreeRSVM and SVM''"" reach similar 
performance. Further, we see in the results that even though PRSVM opti- 
mizes a squared version of the pairwise hinge loss, it still achieves similar test 
performance as the other methods. 



6 Conclusion 

In this work we have proposed an 0{ms + m\og{m)) time method for training 
RankSVM. Empirical results support the complexity analysis, showing that the 
method scales well to large data sets. The experiments demonstrate orders of 
magnitude improvements in training time on large enough data sets, compared 
to the fastest existing previous implementations. Though we have only con- 
sidered the linear RankSVM, the approach could also be used to speed up its 
kernelized version using a reduced set approximation, such as the one proposed 
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by Joachims and Yu (2009). A possible future research direction would be to 
improve the convergence speed of the BMRM for RankSVM training by devising 
a line search procedure similar to the one proposed by [Franc and Sonnenburg] 



(2009) for SVM classification. 
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